In recent years, the focus of the world has been heavily skewed toward healthcare, with the COVID-19 pandemic being the spark. All countries in the world have been affected to varying degrees and many factors of everyday life has not been the same since. Data analytics have been used widely in the global fight against COVID-19, for contract tracing and also to track the general trend of infections in a country. For instance, a 2021 study stated that analyzing health data in real-time with the utilization of AI techniques will have a vital role in predictive and preventive healthcare(Alsunaidi et al., 2021). In this report, I wish to explore two facets of a country that have been impacted by health outbreaks, namely tourism and the population death rate. In particular, visualising data plots to draw insights from observable trends. These trends can identify how well a country has been managed during difficult events. I have not found any other analysis online that uses a similar unique set of indicators to evaluate a country's response to health outbreaks.
I will focus on indicators from the country of Singapore over a 22 year period(2000-2021), during which the country faced Severe Acute Respiratory Syndrome (SARS) outbreak(2003-2004), Influenza A(H1N1) pandemic(2009) and the ongoing COVID-19 pandemic.
For this analysis, the aims are outlined below:
The objectives are to measure impact of mass health outbreaks on the tourism and population death rate. Tourism and population are indicators that are directly affected when a health outbreak occurs. The borders of a country may be tightened or the international perception of a country may be hampered, lowering tourist arrival rates. Unfortunately, an outbreak might also cause an increase in mortality which can be identified via population death rate. Fluctuations in these two indicators are key evaluators of a country's incident response management. A few questions that can be asked during analysis are:
How has tourist arrivals to Singapore been affected during pandemic periods?
How did pandemic periods affect the population death rate?
Which age groups were most affected according to death rate?
Which pandemic can be observed to be the most severe?
Data from WorldData(https://www.worlddata.info/) and Singapore Department of Statistics(DOS)(https://www.singstat.gov.sg/) were chosen as their time series covered time periods of the above-mentioned pandemics. WorldData will provide for the tourism indicator while DOS will provide for population death rate indicator. In case if data from WorldData requires supplementing, DOS will be used as provision. Reliability and accuracy of data is ensured as WorldData states that the data was based on information from United Nations World Tourism Organisation(UNWTO) while DOS is the official governmental statistics department of Singapore. WorldData will be webscraped and data will be taken from DOS in the form of csv format.
As a comparison, Kaggle(https://www.kaggle.com/) was also considered as a data source. However, as data can be uploaded by anyone on that platform, it may have a lack of transparency on the sources and the data trail might not be verifiable. Thus data from Kaggle might not have the same accuracy as WorldData and Singapore Department of Statistics.
There is no tourism data for the year of 2021 from WorldData currently, although COVID-19 pandemic has stretched well into 2021. A solution would be to supplement with data from DOS.
Although 3 pandemics in the span of 22 years is epidemiologically significant, the limited occurences in the context of data points may restrict the richness of insights that can be gathered. More limitations and constraints will be elaborated on during the analysis.
If you need this data for school or university, but do not earn money and do not spread it otherwise, you are welcome to use them.
Subject to these Terms of Use, we grant you free, worldwide, perpetual and non-exclusive use of the Contents made available on this Website for the purpose of (a) copying, distribution or transmission of the Contents and (b) using the Contents to develop or derive, for sale or otherwise, any products and services or to resell the Contents in any form to any Third Party, provided that you: a. credit the source of the Contents;
b. use the Contents in a way that is legal and in accordance with all applicable laws;
c. ensure that no analysis or transformation of the Contents may be presented in a manner which suggests or is likely to lead to the belief that the analysis or transformation of the Contents is attributed to us;
d. cease to use the Contents and remove them from your applications or websites upon our request in the event that the Contents are no longer provided on this Website or of a breach of any of these Terms of Use;
e. ensure the datasets and data in the Contents are accurately reproduced; and
f. do not use the Contents in a way that suggests we are associated to you or we endorse you or your use of the Contents.
The data used in this report have been assessed to have minimal harm(if any) on individuals and organisations. This is due to a lack of individual identification and comes from governmental data sources. The analysis might have the potential to form new intellectual property(IP) based on the findings that can be considered as a basic measure of a country's perfomance in response to health outbreaks. However, I would like to state that any findings from this report are not new IP and are based on specific transformation of data for the purpose of this report.
This report strives to be factual and objective on the above stated indicators of Singapore during health outbreaks. The analysis will not claim to be an ultimate representation of Singapore's performance on pandemic management. Conclusions drawn will be neutral and report objectively based on the data findings.
If anyone wishes to use the data sources in this report, they have to separately abide by the clauses that govern these respective data sources. Use of data transformation or findings from this report will also be governed by the same clauses from the original sources. The transformation and analysis of the data are done by me and in no way attributed to the original data sources.
Here, data will be scraped from WorldData pertaining to tourist arrivals in Singapore. The below code scrapes the data in a few seconds:
#web scraping
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.worlddata.info/asia/singapore/tourism.php")
soup = BeautifulSoup(page.content, 'html.parser')
scraped_data = [s.get_text() for s in soup.find_all('tr', class_='right')]
print(scraped_data)
['20202.74 m', '201919.12 m', '201818.51 m20.42 bn $5.4 %1,103 $', '201717.43 m19.89 bn $5.8 %1,142 $', '201616.40 m18.94 bn $5.9 %1,155 $', '201515.23 m16.62 bn $5.4 %1,091 $', '201415.10 m19.16 bn $6.1 %1,269 $', '201315.57 m19.23 bn $6.3 %1,235 $', '201214.50 m18.80 bn $6.4 %1,297 $', '201113.17 m17.93 bn $6.4 %1,361 $', '201011.64 m14.18 bn $5.9 %1,218 $', '20099.68 m9.23 bn $4.8 %953 $', '200810.12 m10.62 bn $5.5 %1,049 $', '200710.29 m9.07 bn $5.0 %881 $', '20069.75 m7.54 bn $5.1 %773 $', '20058.94 m6.21 bn $4.9 %694 $', '20048.33 m5.33 bn $4.6 %640 $', '20036.13 m3.84 bn $3.9 %627 $', '20027.57 m4.46 bn $4.8 %589 $', '20017.52 m4.64 bn $5.2 %617 $', '20007.69 m5.14 bn $5.4 %669 $', '19996.96 m5.09 bn $5.9 %731 $', '19986.24 m4.60 bn $5.4 %737 $', '19977.20 m6.33 bn $6.3 %879 $', '19967.29 m7.40 bn $7.7 %1,015 $', '19957.14 m7.61 bn $8.7 %1,066 $']
Save the scraped data into a csv file for later use:
#save scraped data to csv
import csv
with open("scraped_data.csv","w") as f:
wr = csv.writer(f,delimiter="\n")
wr.writerow(scraped_data)
Import and clean previously scraped data from csv file and focus only on tourist arrivals by year, also removing alphabets:
#import and clean previously scraped tourism data
with open('scraped_data.csv', newline='') as f:
reader = csv.reader(f)
scraped_data = list(reader)
scraped_data.pop()
#clean data
cleaned_data = []
for d in scraped_data:
cleaned = d[0].split('m',1)[0]
cleaned_data.append(cleaned.strip())
print(cleaned_data)
['20202.74', '201919.12', '201818.51', '201717.43', '201616.40', '201515.23', '201415.10', '201315.57', '201214.50', '201113.17', '201011.64', '20099.68', '200810.12', '200710.29', '20069.75', '20058.94', '20048.33', '20036.13', '20027.57', '20017.52', '20007.69', '19996.96', '19986.24', '19977.20', '19967.29', '19957.14']
As the data is currently combined, separate years and arrival numbers and convert to dataframe. Check for out of bounds values:
#separate years and arrival numbers
import pandas as pd
arrivals = []
for data in cleaned_data:
year_arrival = []
year_arrival.append(data[:4])
year_arrival.append(data[4:])
arrivals.append(year_arrival)
#convert to dataframe
arrivals_df = pd.DataFrame(arrivals, columns =['Year', 'Arrivals(millions)'])
#limit to 5 rows
arrivals_df.head(5)
| Year | Arrivals(millions) | |
|---|---|---|
| 0 | 2020 | 2.74 |
| 1 | 2019 | 19.12 |
| 2 | 2018 | 18.51 |
| 3 | 2017 | 17.43 |
| 4 | 2016 | 16.40 |
#check for out of bounds values
arrivals_df.describe().round(1)
| Year | Arrivals(millions) | |
|---|---|---|
| count | 26 | 26 |
| unique | 26 | 26 |
| top | 2005 | 2.74 |
| freq | 1 | 1 |
Population death rate was taken from DOS website(https://tablebuilder.singstat.gov.sg/table/TS/M810141) and downloaded as a csv file. The data contains filters by age groups and also the overall death rate.
Import data from csv file and convert to dataframe.
#import csv
death_rate_df = pd.read_csv('death_rate.csv', skiprows = 10, nrows= 25)
#limit to 5 rows
death_rate_df.head(5)
| Data Series | 2021 | 2020 | 2019 | 2018 | 2017 | 2016 | 2015 | 2014 | 2013 | ... | 1969 | 1968 | 1967 | 1966 | 1965 | 1964 | 1963 | 1962 | 1961 | 1960 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Total Age Specific Death Rate | 5.8 | 5.2 | 5 | 5.0 | 5 | 4.8 | 4.8 | 4.7 | 4.6 | ... | 5 | 5.5 | 5.3 | 5.4 | 5.4 | 5.7 | 5.6 | 5.8 | 5.9 | 6.2 |
| 1 | Under 1 Year | 1.8 | 1.8 | 1.7 | 2.1 | 2.2 | 2.4 | 1.7 | 1.8 | 2.0 | ... | 20.9 | 23.4 | 24.8 | 25.8 | 26.3 | 29.9 | 28.1 | 31.2 | 32.3 | 34.9 |
| 2 | 1 - 4 Years | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.2 | ... | 1.3 | 1.7 | 1.5 | 2.1 | 1.9 | 1.9 | 2 | 2.5 | 2.7 | 3.3 |
| 3 | 5 - 9 Years | - | - | 0.1 | 0.1 | - | 0.1 | - | - | 0.1 | ... | 0.5 | 0.5 | 0.6 | 0.6 | 0.6 | 0.7 | 0.7 | 0.8 | 0.9 | 1 |
| 4 | 10 - 14 Years | 0.1 | 0.1 | - | 0.1 | 0.1 | 0.1 | 0.1 | - | 0.1 | ... | 0.5 | 0.5 | 0.5 | 0.5 | 0.6 | 0.6 | 0.6 | 0.6 | 0.7 | 0.7 |
5 rows × 63 columns
Check for out of bounds values:
death_rate_df.describe().round(1)
| 2018 | 2016 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004 | 2003 | 2002 | 2001 | 2000 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 | 25.0 |
| mean | 26.1 | 26.1 | 28.1 | 27.9 | 28.2 | 29.2 | 29.0 | 29.8 | 31.1 | 31.1 | 31.7 | 31.8 | 34.4 | 32.9 | 33.4 | 35.4 |
| std | 41.3 | 40.5 | 43.8 | 43.0 | 43.2 | 44.9 | 44.2 | 44.4 | 46.1 | 46.2 | 46.9 | 46.4 | 50.3 | 47.7 | 48.4 | 51.0 |
| min | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
| 25% | 0.4 | 0.3 | 0.4 | 0.4 | 0.4 | 0.3 | 0.4 | 0.4 | 0.4 | 0.5 | 0.5 | 0.5 | 0.6 | 0.5 | 0.6 | 0.6 |
| 50% | 3.9 | 4.1 | 4.3 | 4.5 | 4.4 | 4.4 | 4.3 | 4.4 | 4.5 | 4.4 | 4.4 | 4.4 | 4.5 | 4.4 | 4.3 | 4.5 |
| 75% | 39.6 | 41.5 | 43.4 | 42.8 | 43.7 | 44.8 | 45.4 | 47.7 | 50.2 | 48.9 | 49.5 | 51.0 | 55.3 | 52.3 | 52.9 | 56.0 |
| max | 155.1 | 148.5 | 164.7 | 157.2 | 158.2 | 163.6 | 159.2 | 157.6 | 159.2 | 162.7 | 163.3 | 158.3 | 177.9 | 162.1 | 162.4 | 168.8 |
Now our main data has been imported and fitted into dataframes, we can do a processing to define the final dataframes for visualisation. The dataframes can also be restricted to appropriate time periods. The indicators that were discussed earlier for the analysis were tourism and population death rates. Features have to be defined in order to proceed with visualisation and analysis. As the death rate dataset consists of overall rate and segregation by age groups, the data can be split into two features. The final key features to visualise trends will be:
Tourist arrival rates
Overall population death rate
Death rate by age groups
The defined time period will be year 2000 to year 2021. As the tourist arrivals data only contains up to year 2020, data for 2021 has to be supplemented from DOS(https://tablebuilder.singstat.gov.sg/table/TS/M550001) as a downloaded csv file. However, the supplemented data is in monthly format, thus some transformations have to be done to combine the data for a single year.
#import DOS tourism data from csv file
tourism_rate_df = pd.read_csv('tourism_rate.csv', skiprows = 10, nrows= 1)
tourism_rate_df.head()
| Data Series | 2022 Apr | 2022 Mar | 2022 Feb | 2022 Jan | 2021 Dec | 2021 Nov | 2021 Oct | 2021 Sep | 2021 Aug | ... | 1978 Oct | 1978 Sep | 1978 Aug | 1978 Jul | 1978 Jun | 1978 May | 1978 Apr | 1978 Mar | 1978 Feb | 1978 Jan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Total International Visitor Arrivals By Inboun... | 294304 | 121197 | 67759 | 57167 | 92795 | 41153 | 23979 | 18997 | 15879 | ... | 177639 | 167980 | 201355 | 175968 | 149896 | 162667 | 162400 | 163199 | 147954 | 167016 |
1 rows × 533 columns
Restrict data to months of 2021 and total the numbers:
#restrict data months
new_tourism_rate_df = tourism_rate_df.iloc[:, 5:17]
new_tourism_rate_df.head()
| 2021 Dec | 2021 Nov | 2021 Oct | 2021 Sep | 2021 Aug | 2021 Jul | 2021 Jun | 2021 May | 2021 Apr | 2021 Mar | 2021 Feb | 2021 Jan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 92795 | 41153 | 23979 | 18997 | 15879 | 18517 | 10029 | 14190 | 25737 | 27202 | 18141 | 23366 |
#sum values and convert to millions unit
year_2021 = new_tourism_rate_df.astype(int).sum(axis = 1, skipna = True)/1000000
print(year_2021)
0 0.329985 dtype: float64
Add the value for year 2021 into arrivals dataframe and restrict year range to 2000-2021:
#add year 2021 row
row_2021 = ['2021',float(round(year_2021, 2))]
arrivals_df.loc[-1] = row_2021
arrivals_df.index = arrivals_df.index + 1
arrivals_df = arrivals_df.sort_index()
arrivals_df = arrivals_df.head(22)
#limit to 5 rows for display
arrivals_df.head()
| Year | Arrivals(millions) | |
|---|---|---|
| 0 | 2021 | 0.33 |
| 1 | 2020 | 2.74 |
| 2 | 2019 | 19.12 |
| 3 | 2018 | 18.51 |
| 4 | 2017 | 17.43 |
Overall population death rate has to be extracted from the existing death_rate dataframe, processed and restricted to year range 2000-2001:
#extract first row from death_rate dataframe for the overall rate
overall_death_rate_df = death_rate_df.head(1)
death_rate_df = death_rate_df.iloc[0: , :]
overall_death_rate_df = overall_death_rate_df.iloc[:, :23]
overall_death_rate_df = overall_death_rate_df.transpose()
overall_death_rate_df = overall_death_rate_df.reset_index(level=0)
overall_death_rate_df= overall_death_rate_df.rename(columns=overall_death_rate_df.iloc[0]).drop(overall_death_rate_df.index[0])
overall_death_rate_df.rename(columns = {'Data Series':'Year', 'Total Age Specific Death Rate':'Overall Death Rate'}, inplace = True)
#limit to 5 rows for display
overall_death_rate_df.head()
| Year | Overall Death Rate | |
|---|---|---|
| 1 | 2021 | 5.8 |
| 2 | 2020 | 5.2 |
| 3 | 2019 | 5 |
| 4 | 2018 | 5 |
| 5 | 2017 | 5 |
Remove first row from death_rate dataframe to have the age groups remaining, process and restrict to year range 2000-2001:
#remove first row from death_rate dataframe
death_rate_df = death_rate_df.iloc[1: , :]
death_rate_df = death_rate_df.iloc[:19 , :]
death_rate_df = death_rate_df.iloc[: , 0:23]
death_rate_df = death_rate_df.transpose()
death_rate_df.reset_index(level=0)
death_rate_df = death_rate_df.rename(columns=death_rate_df.iloc[0]).drop(death_rate_df.index[0])
death_rate_df.rename(columns = {'Data Series':'Year'}, inplace = True)
#limit to 5 rows for display
death_rate_df.head()
| Under 1 Year | 1 - 4 Years | 5 - 9 Years | 10 - 14 Years | 15 - 19 Years | 20 - 24 Years | 25 - 29 Years | 30 - 34 Years | 35 - 39 Years | 40 - 44 Years | 45 - 49 Years | 50 - 54 Years | 55 - 59 Years | 60 - 64 Years | 65 - 69 Years | 70 - 74 Years | 75 - 79 Years | 80 - 84 Years | 85 - 89 Years | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2021 | 1.8 | 0.1 | - | 0.1 | 0.3 | 0.2 | 0.3 | 0.3 | 0.5 | 0.8 | 1.5 | 2.3 | 3.8 | 6.4 | 9.7 | 15.8 | 28.2 | 49.8 | 93.4 |
| 2020 | 1.8 | 0.1 | - | 0.1 | 0.2 | 0.2 | 0.3 | 0.3 | 0.5 | 0.8 | 1.3 | 2.2 | 3.7 | 5.9 | 9.2 | 15.1 | 26.1 | 49.1 | 83.7 |
| 2019 | 1.7 | 0.1 | 0.1 | - | 0.2 | 0.3 | 0.2 | 0.4 | 0.5 | 0.8 | 1.3 | 2.1 | 3.7 | 5.9 | 9 | 15.3 | 27.2 | 49.2 | 84.5 |
| 2018 | 2.1 | 0.1 | 0.1 | 0.1 | 0.2 | 0.3 | 0.3 | 0.4 | 0.5 | 0.7 | 1.4 | 2.3 | 3.9 | 6.3 | 9.5 | 16.5 | 28.3 | 51.4 | 84.6 |
| 2017 | 2.2 | 0.1 | - | 0.1 | 0.2 | 0.2 | 0.3 | 0.4 | 0.5 | 0.8 | 1.4 | 2.6 | 3.9 | 6.3 | 9.9 | 16.9 | 28.9 | 53.7 | 90.6 |
Now 3 dataframes have been prepared for visualisation. Each dataframe will be plotted and analysed at 3 time periods, 2003-2004 for SARS, 2009 for H1N1 and 2019 onwards for COVID-19. This section will strive to answer the questions set out in the aims and objectives while elaborating on limitations and constraints.
Below is the histogram plot for tourist arrival rates:
#plot arrivals
import plotly.express as px
px.histogram(data_frame = arrivals_df,
x = 'Year',
y = 'Arrivals(millions)'
)
From 2003-2004(SARS) it seems that tourist arrivals have dipped in 2003 compared to previous years. By 2004 the trend had recovered back to normality, even outperforming the years before 2003. In 2009(H1N1) there was only a slight dip. The most significant change comes after the year that COVID-19 started, with a 6 fold decrease in tourist arrivals in 2020. An even steeper decrease in 2021 reduced the number to the lowest in 22 years. The combined number of 2020 and 2021 were lower than any other single year.
Based on the above observations, all three pandemics impacted tourist arrivals negatively in varying levels. 2009(H1N1) has least impact while 2019 onwards(COVID-19) was the hardest hitting.
Below is the histogram plot for overall population death rate:
#plot overall death rate
import plotly.express as px
px.histogram(data_frame = overall_death_rate_df,
x = 'Year',
y = 'Overall Death Rate'
)
2003(SARS) had a slight increase in death rate but dropped to normal by 2004. Surprisingly, 2009(H1N1) had a slight dip in rate despite being a pandemic year. For 2019 onwards(COVID-19), 2020 had a slight increase while 2021 surpassed all the previous years with a large jump in rate.
Based on the above observations, all three pandemics had an impact on the population death rate. 2003(SARS) and 2019 onwards(COVID-19) was impacted positively, with COVID-19 having the sharpest hike. 2009(H1N1) had a slight negative impact on the trend.
Below is the graph plot for death rate by age groups:
#plot death rate by age groups
import plotly.graph_objs as go
import plotly.express as px
death_rate_df.index
df = death_rate_df.T
fig = go.Figure()
for col in df.columns:
fig.add_trace(go.Scatter(x=df.index, y=df[col]))
fig.update_layout(autotypenumbers='convert types')
fig.show("notebook")
For this graph, trace indicates the year(etc trace 21 is the year 2021). An exponential trend can be observed across the age groups, corresponding to the maturity of age. There are no observable outliers. For 2003-2004(SARS) and 2009(H1N1) there seems to be little variation from the surrounding years. 2019 onwards(COVID-19) there is an increase year on year from 2020 to 2021. 2021 holds the highest death rate across all age groups in the entire 22 year period.
The general trend from the above observations was the oldest age group had the highest death rate. This is expected as due to natural causes, people generally have weaker immune systems as they age.
COVID-19 was observed to have the largest impact across every indicator, outclassing SARS and H1N1 quantitatively.
While death rate may be a general indicator of how well a pandemic has been managed, there can be complex underlying factors that may also influence the death rate. Some factors may include the infectious and mortality nature of the disease and the resources of the healthcare system. Such influences may limit the relationship in which insights can be drawn from the death rate.
From the analysis and findings, SARS and H1N1 had a moderate to minimal effect on Singapore's tourism and death rate. Singapore seems to have responded to both pandemics relatively well, quickly recovering in the subsequent year. COVID-19 seems to be an outlier compared to the other pandemics, severely impacting tourism and death rate. However, as medical science has continually progressed in the two decades, it may not be the quality of response from Singapore to COVID-19 but rather the nature of disease. A 2021 study had remarked that Singapore demonstrated a strong capacity to identify, trace and document COVID-19 cases(Menkir et al., 2021).
Alsunaidi, S., Almuhaideb, A., Ibrahim, N., Shaikh, F., Alqudaihi, K., Alhaidari, F., Khan, I., Aslam, N. and Alshahrani, M., 2021. Applications of Big Data Analytics to Control COVID-19 Pandemic. Sensors, 21(7), p.2282.
Menkir, T., Chin, T., Hay, J., Surface, E., De Salazar, P., Buckee, C., Watts, A., Khan, K., Sherbo, R., Yan, A., Mina, M., Lipsitch, M. and Niehus, R., 2021. Estimating internationally imported cases during the early COVID-19 pandemic. Nature Communications, 12(1).